Databricks Data Profiler

Data Profiling is the process of examining, analyzing, and summarizing a sample dataset to gain insights into the quality of data, based on the selected parameters. In the Lazsa Platform to add a data profiler stage, do the following:

  1. Add a data quality stage after the data lake stage.

  2. Add a data profiler node to the data quality stage. Connect the node to and from the data lake, as shown below:

    Add Data Profiler stage

  3. Click the data profiler node and then click Create Job to create a data profiler job.

    Create Data Profiler job

    Note: After you create a job and run it for the first time then the Profiler Result tab is visible. If you use the Validate constraint in your profiler job, then you can also view the Validated Profiler Result.

  4. Provide the following information for the data profiler job:

    1. Job Name

      • Template - this is automatically selected based on the stages added to the pipeline.

      • Job Name - provide a name for the data profiler job.

      • Node Rerun Attempts - the number of times the job is rerun in case of failure. The default setting is done at the pipeline level.

    2. Source

      Data Profiler Job Source Config

      • Source - this is automatically selected depending on the type of source added in the pipeline.

      • Datastore - this is automatically selected depending on the configured datastore.

      • Source Format – You can either select Parquet or Delta table.

      • Choose Base Path – Select the path, where the source file is located.

      • Records from File – provide the percentage of records or number of records to be processed in the data profiler job.

      • Profiler Features – select the constraints on which you want the data profiler job to run. Choose from:

        • Completeness – checks whether the data fulfill the expectation of comprehensiveness. For example if customer name is asked, then whether first name and last name are present for all records. If either of the two is missing the record is incomplete.

        • Validity – checks whether data is available in the prescribed format. For example, in case of date of birth, if the specified format is mm:dd:yy and data is provided in dd:mm:yy format, then the record is invalid.

        • Count - provides a count of the following type of records in each column:

          • Distinct - values that are distinct, even if they appear multiple times.

          • Filled - values that are not empty.

          • Null - values that are null.

        • Character Count - provides a count of the following type of records in each column:

          • Numbers - values that are alphanumeric.

          • Numbers Only - values that are only numeric.

          • Letters - values that are alphanumeric.

          • Letters Only - values that are only alphabtical.

          • Numbers and Letters - values that are alphanumeric.

          • Special Characters - values that have special characters.

        • Statistical Value - provides details about the following values in the records:

          • Minimum - the maximum value amongst the records.

          • Maximum - the minimum value amongst the records.

          • Mean value - the average of the value in the records.

          • Standard Deviation - measure of variation from the average value.

        • Recommendation - information about columns that are not null.

    3. Target

      • Target – this is automatically selected depending on the type of target you select in the pipeline.

      • Datastore – this is automatically selected depending on the configured datastores to which you have access.

      • Choose Target Format – select either Parquet or Delta table.

      • Target Folder – select the target folder when you want to store the data profiler job output.

      • Target Path – you can provide an additional folder path. This is appended to the target folder.

      • Audit Tables Path – this path is formed based on the folders selected. This is appended with a folder Data_Profiler_Job_audit_table.

      • Final File Path – the final path is created in two folders – TempProfilerOutput and then ProfilerOutput.

    4. Cluster Configuration - select one of the following options:

      • All Purpose Cluster - select a cluster from the dropdown.

      • Job Cluster - create a cluster that is specifically used for this job. Provide the following details:

        • Cluster Details

          Choose Cluster Select a cluster.
          Job Configuration Name Provide a name for the configuration.
          Databricks Runtime Version Provide the version of Databricks that is used.
          Worker Type Select a configuration for the worker node.
          Workers Specify the number of worker nodes required for this job.
          Enable Autoscaling

          Enable this option and specify the following:

          • Min Workers - minimum number of workers.

          • Max Workers - maximum number of workers.

          Databricks chooses the appropriate number of workers required to run the job, from the provided range.

        • Cloud Infrastructure Details

          First on Demand - lets you pay for the compute capacity by the second.  

          Availability - select from the following options:

          • Spot

          • On-demand

          • Spot with fallback

           
          Zone - select a zone from the available options.  
          Instance Profile ARN - Provide the ARN for the AWS instance on which the Databricks cluster is deployed.  
          EBS Volume Type - Select the type of EBS volume being used for the AWS instance.  
          EBS Volume Count - Specify the number of EBS volumes to be used for the AWS instance.  
          EBS Volume Size - Select the size of the EBS volume.  
        • Additional Details
      Spark Config
      Environment Variables

      Logging Path - select one of the following options:

      • DBFS - provide the path for storing the logs on DBFS.

      • S3 - provide the path for storing logs on S3 and select the region.

      Init Scripts - select one of the following options and provide the required details:

      • Workspace

      • DBFS

      • S3

  5. After you create the job, you can run the job in the following two ways:

    • Publish the pipeline with the changes and then run the pipeline.

    • Click the Data Profiler node, and click Start.

      DQ Start Data Profiler job

  6. Click View Profiler Results to view the results of the Data Profiler job. After viewing the results, you can validate columns based on a specified pattern.

    DQ View Profiler Results

  7. Specify the pattern for validation and click Validate to validate the data in the selected columns based on the provided pattern.

    DQ Validate Profiler Results

    Note: The pipeline must be in Edit mode for the Validate button to be enabled.

  8. Click Start or run the pipeline to run the validation job. Once the job is complete you can view the results of that job under the Validated Profiler Result tab. You can download the results in the form of a CSV file.

Related Topics Link IconRecommended Topics What's next? Databricks Data Analyzer